Create a fully automated YouTube video with Text-to-speech services & comparison of Amazon Polly vs. IBM Watson

Simona

2020.05.06

この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。

Intro.

Text to speech (TTS) is a popular area in machine learning. As technology evolves, the option of TTS has increased drastically. In recent years, cloud computing companies have improved TTS with the growth of big data and artificial intelligence applications. I will compare 2 TTS that I used to create AWS tutorials on YouTube. The tutorials can be found in this playlist.

Nowadays, big cloud computing companies provide APIs for speech recognition services makes it easy to use. Contrary to open sources TTS services, TTS APIs provided by cloud computing companies ensures that personal data remains within the user account. I will share my experience with Amazon Polly and IBM Watson here in this article. Note that I have used the Demo version of IBM Watson and no personal data is involved.

1. IBM Watsons

The first 3 videos were created with IBM Watsons Text to Speech service. This is the link to the Demo version I used to create the tutorial videos.

https://text-to-speech-demo.ng.bluemix.net/

Features:

14 languages & variations — 27 voices (13 neural and 14 standard) across 7 languages

Pricing:

Lite plan gives you 500 Minutes per month free
Standard plan starting from $0.02USD/Minute

Pro

Doesn’t require to create an account
Source code can be forked from GitHub

Con

Cannot resolve abbreviations such as AWS, IAM. Work Around, type “A” “W” “S” to force IBM to spell out each alphabet
The downloaded file doesn’t come with a file extension, thus, require to append “.mp3” to each “synthesize” file manually

An example of speech using IBM Watson can be found in this video:

2. Amazon Polly

Amazon Polly doesn’t have a demo site, therefore, it requires login to an AWS account. For those of you who haven't sign up for an AWS account, you can follow this tutorial to create an AWS account:

Features:

29 languages & variations
Standard TTS voices, and Neural Text-to-Speech (NTTS) voices that improve speech quality for more natural and human-like voices.

Pricing

Pay-as-you-go model: Standard voices $4.00USD/1M characters, Neural voices $16.00USD/1M characters
Free Tier: Standard voices 5M characters/mn, Neural voices 1M characters /mn for first 12 months starting from first request for speech

Pro